R Installation

First, download and install R from https://cran.r-project.org/.

Second, download and install RStudio Desktop (Open Source License) from https://www.rstudio.com/products/rstudio/download/ (scroll down the web page to see download links).

R is a high-level programming language and environment for statistical computing. RStudio is a very useful and commonly used development interface built on top of R. Both are free. Both are available on Windows, Mac OS X, and Ubuntu. This tutorial uses R v3.4.2 and RStudio Desktop v1.1.383, but you can use more recent versions if available.

If you are having trouble with installation, try to Google the errors you are getting - very likely someone had similar issues to yours and there is a solution online. If you are still having trouble, reach out to the creators of the software.

After installation, both R and RStudio will be available in your installation directory. However, you do not need to use R directly yourself, RStudio uses it on your behalf, and you only directly interact with RStudio.

Now, proceed and launch RStudio.

First steps

Once you have launched RStudio, you will see the following interface.

Create a project: File > New Project > Existing Directory > Browse … > Create Project (the directory you pick will be your working directory)

Now, open new R script file: File > New File > R script

Understand the layout

  1. Top left corner - text editor - you type all your code here and can then run selected code using Cmd + Enter (Mac) or Ctrl + Enter (Win).
  2. Bottom left corner - console - this is where code is actually run and where output (including error messages) is shown.
  3. Top right corner - list of defined variables and loaded data.
  4. Bottom right corner - variety of windows. Plots will appear here under “Plots” tab. You can install packages and view already installed ones under the “Packages” tab. And you can type in names of functions under “Help” tab to find explanation of their operation.

Alternatively, we could use special R notebook (R markdown file) to mix code with text.

You could open it as follows: File > New File > R Notebook

R notebook allows you to mix together LaTeX equations, markdown notation, and R code, which is very useful when creating reports. In fact, this tutorial is written using R notebook format. The only peculiarity is that you need to mark R code in a special way as shown in the example. Then you can compile the notebook into one of several file formats and the code output will be embedded in text.

R basics

Below I provide a review of some key functionality. However, there is much more to R, and there are many useful commands contained within specific packages. Use Google and stackoverflow to your advantage - R has great community - if there is a question you have - it was likely asked by someone and there is a response online. Learning how to find such information is an important skill for an R user.

# hello world
print("Hello world")
> [1] "Hello world"
# variable assignments (note, R convention for assignment is '<-'; but '=' will
# also work) also note that by default the numbers are class 'numeric'
a <- 3/2
a
> [1] 1.5
class(a)  # tells us the class of the object
> [1] "numeric"
# this is how you would create an integer -- rarely used
b <- as.integer(3/2)
b
> [1] 1
class(b)
> [1] "integer"
# what if we round a number to the lowest integer?
k <- floor(3/2)
k
> [1] 1
class(k)
> [1] "numeric"
# and now to the highest integer
ceiling(3/2)
> [1] 2
round(3/2)
> [1] 2
# class character
class("s")
> [1] "character"
# turning numeric to character
as.character(3)
> [1] "3"
# and back
as.numeric("3")
> [1] 3
# creating a vector of numbers
d <- c(1, 2, 4, 5)
d
> [1] 1 2 4 5
# notice, vector of numeric is class numeric too
class(d)
> [1] "numeric"
# creating a vector of strings
k <- c(1, "dog", 2, "a", 4, 6)
k
> [1] "1"   "dog" "2"   "a"   "4"   "6"
class(k)
> [1] "character"
# converting character vector to numeric - NA missing values introduced
as.numeric(k)
> Warning: NAs introduced by coercion
> [1]  1 NA  2 NA  4  6
# checking for NA
is.na(as.numeric(k))
> Warning: NAs introduced by coercion
> [1] FALSE  TRUE FALSE  TRUE FALSE FALSE
# logical values
k <- c(TRUE, FALSE, T, F)
k
> [1]  TRUE FALSE  TRUE FALSE
class(k)
> [1] "logical"
# sequences
1:10
>  [1]  1  2  3  4  5  6  7  8  9 10
seq(1, 10)
>  [1]  1  2  3  4  5  6  7  8  9 10
seq(1, 10, 2)  # odd numbers
> [1] 1 3 5 7 9
# first element in a vector
d[1]
> [1] 1
# second element in a vector
d[2]
> [1] 2
# two ways to get the last element in a vector
d[length(d)]
> [1] 5
tail(d, 1)
> [1] 5
# repeat vector three times
rep(d, 3)
>  [1] 1 2 4 5 1 2 4 5 1 2 4 5
# repeat every element of a vector three times
rep(d, each = 3)
>  [1] 1 1 1 2 2 2 4 4 4 5 5 5
# get unique elements
unique(rep(d, each = 3))
> [1] 1 2 4 5
sum(d)  # sum of elements
> [1] 12
prod(d)  # product of elements
> [1] 40
mean(d)  # mean
> [1] 3
sd(d)  # standard deviation
> [1] 1.825742
var(d)  # variance
> [1] 3.333333
sqrt(var(d))  # standard deviation
> [1] 1.825742
# log
log(d)  # base e, also known as ln
> [1] 0.0000000 0.6931472 1.3862944 1.6094379
log10(d)  # base 10
> [1] 0.00000 0.30103 0.60206 0.69897
# max value and index
max(d)
> [1] 5
which.max(d)
> [1] 4
# exp
exp(d)
> [1]   2.718282   7.389056  54.598150 148.413159
# taking to power
d^2
> [1]  1  4 16 25
# applying computation to a vector
sapply(d, function(x) x^2)
> [1]  1  4 16 25
# factor - vector that assumes a discrete set of values (levels)
v <- rep(c("1990", "2000", "2010"), 100)
v
>   [1] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
>  [11] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
>  [21] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
>  [31] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
>  [41] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
>  [51] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
>  [61] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
>  [71] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
>  [81] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
>  [91] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [101] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [111] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [121] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [131] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [141] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [151] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [161] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [171] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [181] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [191] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [201] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [211] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [221] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [231] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [241] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [251] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [261] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
> [271] "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990"
> [281] "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000"
> [291] "2010" "1990" "2000" "2010" "1990" "2000" "2010" "1990" "2000" "2010"
v <- factor(v)
v
>   [1] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [15] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
>  [29] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
>  [43] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [57] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
>  [71] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
>  [85] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [99] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [113] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [127] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [141] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [155] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [169] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [183] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [197] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [211] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [225] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [239] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [253] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [267] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [281] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [295] 1990 2000 2010 1990 2000 2010
> Levels: 1990 2000 2010
levels(v)  # levels of a factor
> [1] "1990" "2000" "2010"
as.numeric(v)  # reveals numeric index corresponding to level ordering
>   [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
>  [36] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
>  [71] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
> [106] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
> [141] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
> [176] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
> [211] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
> [246] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
> [281] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
as.numeric(as.character(v))  # recovers level labels
>   [1] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [15] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
>  [29] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
>  [43] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [57] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
>  [71] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
>  [85] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
>  [99] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [113] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [127] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [141] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [155] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [169] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [183] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [197] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [211] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [225] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [239] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [253] 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000
> [267] 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990
> [281] 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010 1990 2000 2010
> [295] 1990 2000 2010 1990 2000 2010

Random numbers, matrices, and data frames

# matrix of zeros
matrix(0, 4, 4)
>      [,1] [,2] [,3] [,4]
> [1,]    0    0    0    0
> [2,]    0    0    0    0
> [3,]    0    0    0    0
> [4,]    0    0    0    0
# diagonal identity matrix
diag(5)
>      [,1] [,2] [,3] [,4] [,5]
> [1,]    1    0    0    0    0
> [2,]    0    1    0    0    0
> [3,]    0    0    1    0    0
> [4,]    0    0    0    1    0
> [5,]    0    0    0    0    1
# matrix of zeros
m <- matrix(0, 1, 4)
v <- rep(0, 5)

class(m)
> [1] "matrix"
class(v)
> [1] "numeric"
# conversion between matrices and vectors
as.matrix(v)
>      [,1]
> [1,]    0
> [2,]    0
> [3,]    0
> [4,]    0
> [5,]    0
as.vector(m)
> [1] 0 0 0 0
# set seed
set.seed(999)

# generate uniform [0,1] random variables
runif(5)
> [1] 0.38907138 0.58306072 0.09466569 0.85263123 0.78674676
# generate standard normal random variables
rnorm(5)
> [1] -1.1782812 -1.3986638  0.3040954  0.1284985  2.2420250
# create a 12-number vector of uniform[0,1] random numbers and reshape it into
# 4x3 matrix
v <- runif(12)
v
>  [1] 0.78721294 0.16658469 0.82125595 0.13114195 0.06998445 0.90749130
>  [7] 0.63380102 0.55328977 0.34208638 0.82607030 0.81951407 0.56849274
class(v)
> [1] "numeric"
matrix(v, 4, 3)
>           [,1]       [,2]      [,3]
> [1,] 0.7872129 0.06998445 0.3420864
> [2,] 0.1665847 0.90749130 0.8260703
> [3,] 0.8212559 0.63380102 0.8195141
> [4,] 0.1311419 0.55328977 0.5684927
# or filling by row
m <- matrix(v, 4, 3, byrow = TRUE)
m
>           [,1]       [,2]      [,3]
> [1,] 0.7872129 0.16658469 0.8212559
> [2,] 0.1311419 0.06998445 0.9074913
> [3,] 0.6338010 0.55328977 0.3420864
> [4,] 0.8260703 0.81951407 0.5684927
class(m)
> [1] "matrix"
# mean of a matrix
mean(m)
> [1] 0.5522438
# matrix transpose
t(m)
>           [,1]       [,2]      [,3]      [,4]
> [1,] 0.7872129 0.13114195 0.6338010 0.8260703
> [2,] 0.1665847 0.06998445 0.5532898 0.8195141
> [3,] 0.8212559 0.90749130 0.3420864 0.5684927
# matrix dot product
md <- m %*% t(m)
md
>           [,1]      [,2]      [,3]      [,4]
> [1,] 1.3219160 0.8601776 0.8720464 1.2536898
> [2,] 0.8601776 0.8456365 0.4322800 0.6815879
> [3,] 0.8720464 0.4322800 0.8248564 1.1714666
> [4,] 1.2536898 0.6815879 1.1714666 1.6771795
# matrix element-wise product
m * m
>            [,1]        [,2]      [,3]
> [1,] 0.61970422 0.027750460 0.6744613
> [2,] 0.01719821 0.004897823 0.8235405
> [3,] 0.40170373 0.306129572 0.1170231
> [4,] 0.68239214 0.671603316 0.3231840
# multiplication by a scalar
10 * m
>          [,1]      [,2]     [,3]
> [1,] 7.872129 1.6658469 8.212559
> [2,] 1.311419 0.6998445 9.074913
> [3,] 6.338010 5.5328977 3.420864
> [4,] 8.260703 8.1951407 5.684927
# size of a matrix
dim(m)
> [1] 4 3
# number of rows
dim(m)[1]
> [1] 4
# number of columns
dim(m)[2]
> [1] 3
# mean of each row
apply(m, 1, mean)
> [1] 0.5916845 0.3695392 0.5097257 0.7380257
# mean of each column
apply(m, 2, mean)
> [1] 0.5945566 0.4023432 0.6598316
# sum of each row
rowSums(m)
> [1] 1.775054 1.108618 1.529177 2.214077
apply(m, 1, sum)
> [1] 1.775054 1.108618 1.529177 2.214077
# sum of each column
colSums(m)
> [1] 2.378226 1.609373 2.639326
apply(m, 2, sum)
> [1] 2.378226 1.609373 2.639326
# first row of a matrix
m[1, ]
> [1] 0.7872129 0.1665847 0.8212559
# second column of a matrix
m[, 2]
> [1] 0.16658469 0.06998445 0.55328977 0.81951407
# specific element of a matrix
m[1, 2]
> [1] 0.1665847
# matrix inverse
solve(m[1:3, 1:3])
>            [,1]       [,2]       [,3]
> [1,]  1.8049785 -1.5001301 -0.3536953
> [2,] -2.0018047  0.9482986  2.2901270
> [3,] -0.1064618  1.2455919 -0.1254987

Lists

# list of objects - unnamed
l1 <- list(5, "b", 12)
l1
> [[1]]
> [1] 5
> 
> [[2]]
> [1] "b"
> 
> [[3]]
> [1] 12
# getting back a vector of elements in a list
unlist(l1)
> [1] "5"  "b"  "12"
# accessing element in a list - note special format
l1[[1]]
> [1] 5
# named list
l2 <- list(a1 = 5, a2 = c(2983, 1890), a3 = c(3, 118))
l2
> $a1
> [1] 5
> 
> $a2
> [1] 2983 1890
> 
> $a3
> [1]   3 118
# getting named list value
l2$a1
> [1] 5
# applying power to each element of a list
lapply(l2, function(x) x^2)
> $a1
> [1] 25
> 
> $a2
> [1] 8898289 3572100
> 
> $a3
> [1]     9 13924
# unlisting named list
unlist(l2)
>   a1  a21  a22  a31  a32 
>    5 2983 1890    3  118
# taking mean over a list
mean(unlist(l2))
> [1] 999.8

Data frames

# creating a data frame - list of vectors of the same length, it is structured
# like a table or matrix, individual variables are accessed via column names,
# individual observations - via row index
b1 <- c(NA, 22, 50)
b2 <- c("john", "rose", "john")
b3 <- c(TRUE, FALSE, TRUE)
df <- data.frame(b1, b2, b3)

# here is how we access columns
df$b1
> [1] NA 22 50
df$b2
> [1] john rose john
> Levels: john rose
# note that a vector of strings was automatically converted into a factor when
# data frame was created
class(df$b2)
> [1] "factor"
# if we do not want to convert string vectors to factors when creating a data
# frame, we simply add stringsAsFactors=FALSE option
df <- data.frame(b1, b2, b3, stringsAsFactors = FALSE)

class(df$b2)  # now it is character
> [1] "character"
# renaming columns (variables) of a data frame
colnames(df)
> [1] "b1" "b2" "b3"
colnames(df) <- c("age", "name", "indicator")
colnames(df)
> [1] "age"       "name"      "indicator"
# second row
df[2, ]
>   age name indicator
> 2  22 rose     FALSE
# second element of the indicator variable
df$indicator[2]
> [1] FALSE
df[2, "indicator"]
> [1] FALSE
df[2, 3]
> [1] FALSE
# slicing - two first rows and two names columns
df[1:2, c("age", "indicator")]
>   age indicator
> 1  NA      TRUE
> 2  22     FALSE
# picking out elements of data frame based on two vectors of coordinates
df[cbind(1:2, c(1, 3))]
> [1] NA      "FALSE"
# filtering for rows that contain name 'john' in name column
df[df$name == "john", ]
>   age name indicator
> 1  NA john      TRUE
> 3  50 john      TRUE
# keeping rows without missing values in age column
df[!is.na(df$age), ]
>   age name indicator
> 2  22 rose     FALSE
> 3  50 john      TRUE
# binding data frames together
rbind(df, df)  # horizontally
>   age name indicator
> 1  NA john      TRUE
> 2  22 rose     FALSE
> 3  50 john      TRUE
> 4  NA john      TRUE
> 5  22 rose     FALSE
> 6  50 john      TRUE
cbind(df, df)  # vertically
>   age name indicator age name indicator
> 1  NA john      TRUE  NA john      TRUE
> 2  22 rose     FALSE  22 rose     FALSE
> 3  50 john      TRUE  50 john      TRUE
# description statistics
df <- data.frame(a = runif(10), b = runif(10), c = runif(10))

# mean
apply(df, 2, mean)
>         a         b         c 
> 0.5557822 0.4620113 0.3673768
# variance
apply(df, 2, var)
>          a          b          c 
> 0.06514680 0.07555668 0.09733104
# standard deviation
apply(df, 2, sd)
>         a         b         c 
> 0.2552387 0.2748758 0.3119792

Loops, if-else statements, functions, vectorization

# loop
for (i in 1:5) {
    print(i)
}
> [1] 1
> [1] 2
> [1] 3
> [1] 4
> [1] 5
# conditional statements
if (a > 1) {
    print("hi")
} else if (a > 0) {
    print("hello")
} else {
    print("bye")
}
> [1] "hi"
# function
custom_mean <- function(v) {
    # taking mean over a vector via a for-loop
    temp <- 0
    for (i in v) {
        temp <- temp + i
    }
    return(temp/length(v))
}

v <- runif(1e+06)

custom_mean(v)
> [1] 0.5001688
mean(v)
> [1] 0.5001688
# which one is faster?
s <- Sys.time()
custom_mean(v)
> [1] 0.5001688
e <- Sys.time()
e - s
> Time difference of 0.02652884 secs
s <- Sys.time()
mean(v)
> [1] 0.5001688
e <- Sys.time()
e - s
> Time difference of 0.00444603 secs
# as we see, native mean() function, which applies to an array as a whole, is
# much faster - loops, which access arrays elementwise, are very slow in R -
# avoid them at all costs!  using mean() function instead of looping over an
# array is an example of vectorization - vectorize everything using R expressions
# - e.g., apply(), linear algebra notation, etc.

# vectorization is fast because R functions are often just an interface to fast
# low-level language code - for example, fast fourier transform calls C code (we
# can print content of the function by typing it without round brackets)
fft
> function (z, inverse = FALSE) 
> .Call(C_fft, z, inverse)
> <bytecode: 0x7fa9d82f5b90>
> <environment: namespace:stats>
# and here is how you can get help on the function
`?`(fft)

Loading packages

Note, you need to install the following packages before we can load and use them:

One way to do it is to go into Packages tab in the bottom right window of RStudio, click install, enter package name, and press enter (you can also update packages from this tab). Alternatively, you can use install.packages(c(“gdata”,“ggplot2”,“ggthemes”,“scales”,“arm”)) command.

# Note, you need to install packages below before loading them
# install.packages(c('gdata','ggplot2','ggthemes','scales','arm'))

suppressMessages(library("gdata"))  # loads library and suppresses unnecessary output
library("ggplot2")
library("ggthemes")
library("scales")
suppressMessages(library("arm"))

Downloading and loading data

# the project is based on rolling sales update provided by NYC
# http://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page

# data for Manhattan in excel format is available via this link
# http://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/rollingsales_manhattan.xls

# download the data (5.3MB)
download.file("http://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/rollingsales_manhattan.xls", 
    destfile = "manhattan_real_estate.xls")

# load xsl file into a data frame - this will take a moment
data <- read.xls("manhattan_real_estate.xls", header = TRUE, pattern = "BOROUGH")

# do ?read.xls and you will find that 'pattern' requires that all lines are
# skipped before the requested pattern is found

Data cleaning

# data frame
class(data)
> [1] "data.frame"
# size of data
dim(data)
> [1] 18066    21
# names of columns - could be overwritten if desired
colnames(data)
>  [1] "BOROUGH"                        "NEIGHBORHOOD"                  
>  [3] "BUILDING.CLASS.CATEGORY"        "TAX.CLASS.AT.PRESENT"          
>  [5] "BLOCK"                          "LOT"                           
>  [7] "EASE.MENT"                      "BUILDING.CLASS.AT.PRESENT"     
>  [9] "ADDRESS"                        "APARTMENT.NUMBER"              
> [11] "ZIP.CODE"                       "RESIDENTIAL.UNITS"             
> [13] "COMMERCIAL.UNITS"               "TOTAL.UNITS"                   
> [15] "LAND.SQUARE.FEET"               "GROSS.SQUARE.FEET"             
> [17] "YEAR.BUILT"                     "TAX.CLASS.AT.TIME.OF.SALE"     
> [19] "BUILDING.CLASS.AT.TIME.OF.SALE" "SALE.PRICE"                    
> [21] "SALE.DATE"
# see top 5 rows of the data
head(data, 5)
>   BOROUGH              NEIGHBORHOOD
> 1       1 ALPHABET CITY            
> 2       1 ALPHABET CITY            
> 3       1 ALPHABET CITY            
> 4       1 ALPHABET CITY            
> 5       1 ALPHABET CITY            
>                        BUILDING.CLASS.CATEGORY TAX.CLASS.AT.PRESENT BLOCK
> 1 03  THREE FAMILY DWELLINGS                                      1   376
> 2 07  RENTALS - WALKUP APARTMENTS                                 2   375
> 3 07  RENTALS - WALKUP APARTMENTS                                 2   385
> 4 07  RENTALS - WALKUP APARTMENTS                                2A   392
> 5 07  RENTALS - WALKUP APARTMENTS                                2A   392
>   LOT EASE.MENT BUILDING.CLASS.AT.PRESENT
> 1  24        NA                        C0
> 2  28        NA                        C4
> 3  36        NA                        C7
> 4   5        NA                        C2
> 5   6        NA                        C2
>                                     ADDRESS APARTMENT.NUMBER ZIP.CODE
> 1 264 EAST 7TH   STREET                                         10009
> 2 738 EAST 6TH STREET                                           10009
> 3 27 AVENUE C                                                   10009
> 4 151 AVENUE B                                                  10009
> 5 153 AVENUE B                                                  10009
>   RESIDENTIAL.UNITS COMMERCIAL.UNITS TOTAL.UNITS LAND.SQUARE.FEET
> 1                 3                0           3            2,059
> 2                11                0          11            1,750
> 3                24                1          25            2,650
> 4                 5                0           5            2,139
> 5                 5                0           5            1,633
>   GROSS.SQUARE.FEET YEAR.BUILT TAX.CLASS.AT.TIME.OF.SALE
> 1             3,696       1900                         1
> 2             6,500       1900                         2
> 3             9,960       1910                         2
> 4             4,416       1900                         2
> 5             6,440       1900                         2
>   BUILDING.CLASS.AT.TIME.OF.SALE SALE.PRICE  SALE.DATE
> 1                             C0  7,738,700 2016-12-22
> 2                             C4  3,750,000 2017-04-03
> 3                             C7  5,235,000 2017-07-11
> 4                             C2          0 2017-06-16
> 5                             C2  6,625,000 2017-07-19
# what class?
class(data$SALE.PRICE)
> [1] "factor"
class(data$YEAR.BUILT)
> [1] "factor"
# transform factor into numeric
data$YEAR.BUILT <- as.numeric(as.character(data$YEAR.BUILT))
> Warning: NAs introduced by coercion
# transform factor into numeric, removing commas on the way
data$SALE.PRICE <- as.numeric(gsub(",", "", as.character(data$SALE.PRICE)))

# transform into factor to make it a categorical variable
data$TAX.CLASS.AT.TIME.OF.SALE <- as.factor(data$TAX.CLASS.AT.TIME.OF.SALE)

data <- data[!is.na(data$YEAR.BUILT), ]  # remove observations with a missing year of construction
data <- data[data$YEAR.BUILT >= 1900, ]  # only keep observations after 1900
data <- data[data$SALE.PRICE > 0, ]  # filter out zero prices (e.g., due to inheritance/gifts)

# check size of the data again
dim(data)
> [1] 12476    21
# we could now save the data, dropping row index
write.csv(data, file = "clean_data.csv", row.names = FALSE)

# and here is how we could read it back in
head(read.csv(file = "clean_data.csv", head = TRUE))
>   BOROUGH              NEIGHBORHOOD
> 1       1 ALPHABET CITY            
> 2       1 ALPHABET CITY            
> 3       1 ALPHABET CITY            
> 4       1 ALPHABET CITY            
> 5       1 ALPHABET CITY            
> 6       1 ALPHABET CITY            
>                        BUILDING.CLASS.CATEGORY TAX.CLASS.AT.PRESENT BLOCK
> 1 03  THREE FAMILY DWELLINGS                                      1   376
> 2 07  RENTALS - WALKUP APARTMENTS                                 2   375
> 3 07  RENTALS - WALKUP APARTMENTS                                 2   385
> 4 07  RENTALS - WALKUP APARTMENTS                                2A   392
> 5 07  RENTALS - WALKUP APARTMENTS                                2A   404
> 6 08  RENTALS - ELEVATOR APARTMENTS                               2   387
>   LOT EASE.MENT BUILDING.CLASS.AT.PRESENT
> 1  24        NA                        C0
> 2  28        NA                        C4
> 3  36        NA                        C7
> 4   6        NA                        C2
> 5  55        NA                        C2
> 6 153        NA                        D9
>                                     ADDRESS APARTMENT.NUMBER ZIP.CODE
> 1 264 EAST 7TH   STREET                                         10009
> 2 738 EAST 6TH STREET                                           10009
> 3 27 AVENUE C                                                   10009
> 4 153 AVENUE B                                                  10009
> 5 301 EAST 10TH   STREET                                        10009
> 6 629 EAST 5TH STREET                                           10009
>   RESIDENTIAL.UNITS COMMERCIAL.UNITS TOTAL.UNITS LAND.SQUARE.FEET
> 1                 3                0           3            2,059
> 2                11                0          11            1,750
> 3                24                1          25            2,650
> 4                 5                0           5            1,633
> 5                 6                0           6            2,369
> 6                24                0          24            4,489
>   GROSS.SQUARE.FEET YEAR.BUILT TAX.CLASS.AT.TIME.OF.SALE
> 1             3,696       1900                         1
> 2             6,500       1900                         2
> 3             9,960       1910                         2
> 4             6,440       1900                         2
> 5             4,615       1900                         2
> 6            18,523       1920                         2
>   BUILDING.CLASS.AT.TIME.OF.SALE SALE.PRICE  SALE.DATE
> 1                             C0    7738700 2016-12-22
> 2                             C4    3750000 2017-04-03
> 3                             C7    5235000 2017-07-11
> 4                             C2    6625000 2017-07-19
> 5                             C2    8000000 2016-11-17
> 6                             D9   16232000 2016-11-07

Basic distribution statistics

median(data$SALE.PRICE)  # median sales price
> [1] 1100000
mean(data$SALE.PRICE)  # mean 
> [1] 3417036
max(data$SALE.PRICE)  # max
> [1] 2.21e+09
aggregate(cbind(YEAR.BUILT, SALE.PRICE) ~ TAX.CLASS.AT.TIME.OF.SALE, data = data, 
    FUN = median)  # get median of variables by group
>   TAX.CLASS.AT.TIME.OF.SALE YEAR.BUILT SALE.PRICE
> 1                         1       1910    3922000
> 2                         2       1956    1060000
> 3                         4       1926    5350000
aggregate(cbind(YEAR.BUILT, SALE.PRICE) ~ TAX.CLASS.AT.TIME.OF.SALE, data = data, 
    FUN = length)  # count of rows within each group
>   TAX.CLASS.AT.TIME.OF.SALE YEAR.BUILT SALE.PRICE
> 1                         1        145        145
> 2                         2      11967      11967
> 3                         4        364        364
# other libraries to explore on your own - plyr, dplyr, reshape2, lubridate

Histogram

# simple sales price histogram (log-transformed)
hist(log10(data$SALE.PRICE))

# ggplot2 sales prices histogram
ggplot(data, aes(SALE.PRICE)) + geom_histogram(aes(fill = ..count..), alpha = 0.6, 
    bins = 50) + theme_bw() + ggtitle("Histogram of Property Sales, Manhattan", subtitle = NULL) + 
    scale_fill_gradient("Count", low = "orange", high = "red") + scale_y_continuous("Count") + 
    scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", 
        scales::math_format(10^.x))) + xlab("Sales Price (log-scaled)")

Running a regression

# pearson (linear) correlation between logged prices and year when the building
# was constructed
cor(log10(data$SALE.PRICE), data$YEAR.BUILT)
> [1] 0.05874227
# spearman rank-order correlation (here log transformation would not change
# results because correlation is measured in terms of rank orderings, and log
# preserves order)
cor(data$SALE.PRICE, data$YEAR.BUILT, method = "spearman")
> [1] 0.09275609
# run simple regression
model <- lm(data = data, log10(SALE.PRICE) ~ YEAR.BUILT)

# plot summary of estimates
summary(model)
> 
> Call:
> lm(formula = log10(SALE.PRICE) ~ YEAR.BUILT, data = data)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -6.1407 -0.2666 -0.0473  0.2856  3.2550 
> 
> Coefficients:
>              Estimate Std. Error t value Pr(>|t|)    
> (Intercept) 3.9417833  0.3246502  12.142  < 2e-16 ***
> YEAR.BUILT  0.0010924  0.0001662   6.572 5.16e-11 ***
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.6547 on 12474 degrees of freedom
> Multiple R-squared:  0.003451,    Adjusted R-squared:  0.003371 
> F-statistic: 43.19 on 1 and 12474 DF,  p-value: 5.158e-11
# we can interpret the regression results as follows if we increase construction
# year by 1, e.g. from 2000 to 2001, we increase log10(sales price) by 0.0013421,
# that is, we increase sales price by 100*(10^0.0013421-1) = 0.31% see details on
# interpretation here: http://www.cazaar.com/ta/econ113/interpreting-beta

Plotting a regression line

# plot regression line - relationship between log10(sales price) and year built
# color points by tax class
p <- ggplot(data, aes(YEAR.BUILT, SALE.PRICE, color = TAX.CLASS.AT.TIME.OF.SALE)) + 
    geom_point(alpha = 0.3) + theme_bw() + xlab("Year of Construction") + scale_y_log10(breaks = scales::trans_breaks("log10", 
    function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x)))

# add regression line setting intercept of the line based on beta0, slope of the
# line based on beta1
p + geom_abline(intercept = coef(model)[1], slope = coef(model)[2]) + # annotating
annotate(label = sprintf("log10(y) = %.5f + %.5fx\nR² = %.3f", coef(model)[1], coef(model)[2], 
    summary(model)$r.squared), geom = "text", x = 1975, y = 10^8, size = 4)

Regression diagnostics

# regression diagnostic plots - notice, for example, that qqplot shows deviations
# from assumption that residuals are normally distributed - so we should take
# care interpreting the p-values and statistical significance results
plot(model)

Plotting numerous regression coefficients

# finally, let us build a large model with dummy variables and visualize its
# coefficients and confidence intervals
model <- lm(data = data, log10(SALE.PRICE) ~ YEAR.BUILT + NEIGHBORHOOD)
names(model$coefficients) <- gsub("NEIGHBORHOOD", "", names(model$coefficients))
summary(model)
> 
> Call:
> lm(formula = log10(SALE.PRICE) ~ YEAR.BUILT + NEIGHBORHOOD, data = data)
> 
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -6.1803 -0.2381 -0.0234  0.2539  3.0311 
> 
> Coefficients:
>                             Estimate Std. Error t value Pr(>|t|)    
> (Intercept)                4.3780460  0.3419052  12.805  < 2e-16 ***
> YEAR.BUILT                 0.0008256  0.0001728   4.777 1.80e-06 ***
> CHELSEA                    0.0098571  0.0615615   0.160 0.872792    
> CHINATOWN                  0.1672816  0.1098193   1.523 0.127723    
> CIVIC CENTER               0.6763065  0.0700431   9.656  < 2e-16 ***
> CLINTON                    0.0531781  0.0684135   0.777 0.436993    
> EAST VILLAGE               0.1100625  0.0822384   1.338 0.180812    
> FASHION                    0.4212406  0.0932439   4.518 6.31e-06 ***
> FINANCIAL                  0.1674925  0.0634833   2.638 0.008341 ** 
> FLATIRON                   0.3481066  0.0655872   5.308 1.13e-07 ***
> GRAMERCY                   0.0118903  0.0644379   0.185 0.853605    
> GREENWICH VILLAGE-CENTRAL  0.1459627  0.0638650   2.285 0.022301 *  
> GREENWICH VILLAGE-WEST     0.2396697  0.0640045   3.745 0.000182 ***
> HARLEM-CENTRAL            -0.1126075  0.0623594  -1.806 0.070977 .  
> HARLEM-EAST                0.0491065  0.0717272   0.685 0.493591    
> HARLEM-UPPER              -0.3664598  0.0845821  -4.333 1.49e-05 ***
> HARLEM-WEST               -0.1931072  0.1287613  -1.500 0.133710    
> INWOOD                    -0.2599219  0.0888586  -2.925 0.003449 ** 
> JAVITS CENTER              0.2514754  0.1323487   1.900 0.057443 .  
> KIPS BAY                   0.0116579  0.0711563   0.164 0.869863    
> LITTLE ITALY               0.6759106  0.1464918   4.614 3.99e-06 ***
> LOWER EAST SIDE            0.0518516  0.0708927   0.731 0.464543    
> MANHATTAN VALLEY          -0.1557147  0.0753301  -2.067 0.038746 *  
> MIDTOWN CBD                0.3120173  0.0805102   3.876 0.000107 ***
> MIDTOWN EAST              -0.0159146  0.0590260  -0.270 0.787457    
> MIDTOWN WEST              -0.0842518  0.0626004  -1.346 0.178370    
> MORNINGSIDE HEIGHTS       -0.2046543  0.0985653  -2.076 0.037884 *  
> MURRAY HILL               -0.0553214  0.0636616  -0.869 0.384869    
> ROOSEVELT ISLAND          -0.0063774  0.0997598  -0.064 0.949029    
> SOHO                       0.2509674  0.0703925   3.565 0.000365 ***
> SOUTHBRIDGE                0.0507577  0.0826865   0.614 0.539321    
> TRIBECA                    0.4591325  0.0643516   7.135 1.02e-12 ***
> UPPER EAST SIDE (59-79)    0.0820924  0.0582267   1.410 0.158602    
> UPPER EAST SIDE (79-96)    0.1198151  0.0584239   2.051 0.040308 *  
> UPPER EAST SIDE (96-110)   0.2659099  0.1183407   2.247 0.024658 *  
> UPPER WEST SIDE (59-79)    0.1194639  0.0595128   2.007 0.044732 *  
> UPPER WEST SIDE (79-96)    0.1011770  0.0611579   1.654 0.098080 .  
> UPPER WEST SIDE (96-116)   0.0085379  0.0676184   0.126 0.899524    
> WASHINGTON HEIGHTS LOWER   0.0438792  0.0817637   0.537 0.591513    
> WASHINGTON HEIGHTS UPPER  -0.1448984  0.0691856  -2.094 0.036250 *  
> ---
> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> 
> Residual standard error: 0.6356 on 12436 degrees of freedom
> Multiple R-squared:  0.06372, Adjusted R-squared:  0.06078 
> F-statistic:  21.7 on 39 and 12436 DF,  p-value: < 2.2e-16
coefplot(model, mar = c(0.5, 20, 2, 5), main = "")  # bottom, left, top, and right margins